{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# \ud83d\udcdd Exercise M1.04\n",
    "\n",
    "The goal of this exercise is to evaluate the impact of using an arbitrary\n",
    "integer encoding for categorical variables along with a linear classification\n",
    "model such as Logistic Regression.\n",
    "\n",
    "To do so, let's try to use `OrdinalEncoder` to preprocess the categorical\n",
    "variables. This preprocessor is assembled in a pipeline with\n",
    "`LogisticRegression`. The generalization performance of the pipeline can be\n",
    "evaluated by cross-validation and then compared to the score obtained when\n",
    "using `OneHotEncoder` or to some other baseline score.\n",
    "\n",
    "First, we load the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "adult_census = pd.read_csv(\"../datasets/adult-census.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "target_name = \"class\"\n",
    "target = adult_census[target_name]\n",
    "data = adult_census.drop(columns=[target_name, \"education-num\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the previous notebook, we used `sklearn.compose.make_column_selector` to\n",
    "automatically select columns with a specific data type (also called `dtype`).\n",
    "Here, we use this selector to get only the columns containing strings (column\n",
    "with `object` dtype) that correspond to categorical features in our dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.compose import make_column_selector as selector\n",
    "\n",
    "categorical_columns_selector = selector(dtype_include=object)\n",
    "categorical_columns = categorical_columns_selector(data)\n",
    "data_categorical = data[categorical_columns]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define a scikit-learn pipeline composed of an `OrdinalEncoder` and a\n",
    "`LogisticRegression` classifier.\n",
    "\n",
    "Because `OrdinalEncoder` can raise errors if it sees an unknown category at\n",
    "prediction time, you can set the `handle_unknown=\"use_encoded_value\"` and\n",
    "`unknown_value` parameters. You can refer to the [scikit-learn\n",
    "documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html)\n",
    "for more details regarding these parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.preprocessing import OrdinalEncoder\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "# Write your code here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Your model is now defined. Evaluate it using a cross-validation using\n",
    "`sklearn.model_selection.cross_validate`.\n",
    "\n",
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">Be aware that if an error happened during the cross-validation,\n",
    "<tt class=\"docutils literal\">cross_validate</tt> would raise a warning and return NaN (Not a Number) as scores.\n",
    "To make it raise a standard Python exception with a traceback, you can pass\n",
    "the <tt class=\"docutils literal\"><span class=\"pre\">error_score=\"raise\"</span></tt> argument in the call to <tt class=\"docutils literal\">cross_validate</tt>. An\n",
    "exception would be raised instead of a warning at the first encountered problem\n",
    "and <tt class=\"docutils literal\">cross_validate</tt> would stop right away instead of returning NaN values.\n",
    "This is particularly handy when developing complex machine learning pipelines.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import cross_validate\n",
    "\n",
    "# Write your code here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we would like to compare the generalization performance of our previous\n",
    "model with a new model where instead of using an `OrdinalEncoder`, we use a\n",
    "`OneHotEncoder`. Repeat the model evaluation using cross-validation. Compare\n",
    "the score of both models and conclude on the impact of choosing a specific\n",
    "encoding strategy when using a linear model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import OneHotEncoder\n",
    "\n",
    "# Write your code here."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}